As shown in the figure below, the sequence transformation part of the Doge architecture uses Dynamic Mask Attention, which can be understood as using self-attention related to value states during training, and using state-space without past state decay during inference, to solve the problem of existing Transformers or SSMs getting lost in long text. The state transformation part of Doge uses Cross Domain Mixture of Experts, which consists of dense linear layers and sparse embedding layers, and can additionally increase sparse parameters to continue training from dense weight checkpoints without retraining the entire model, thereby reducing the cost of continuous iteration of the model. In addition, Doge also uses RMSNorm and Residual with learnable parameters to adapt the gradient range of deep models.
Dynamic Mask Attention Module
Cross Domain Mixture of Experts Module
We also hope to use open-source tools and frameworks as much as possible to simplify the process from data processing to model training, so that beginners can easily understand and use them.
We highly recommend that you install the latest version of PyTorch and CUDA for optimal performance.
Of course, you can also use the open-source Docker PyTorch image to avoid the hassle of configuring the environment.
docker pull nvcr.io/nvidia/pytorch:24.12-py3
docker run --privileged --gpus all -it --name PyTorch --shm-size=32g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 -v <your code path>:/workspace -v <your datasets path>:/workspace/Doge/datasets nvcr.io/nvidia/pytorch:24.12-py3
pip install transformers: The core framework for all subsequent work.pip install datasets sentencepiece boto3: Used to download and process datasets.pip install accelerate: Used for distributed training.pip install trl: Used for fine-tuning with reinforcement learning.git clone https://github.com/SamllDoge/SmallDoges.git
cd SmallDoges
pip install -e .
We have written a notebook (still being updated) to demonstrate the entire process of datasets processing, model training, and model evaluation. You can use the following complete architecture or individual modules.
Doge uses wsd_scheduler as the training scheduler, which divides the learning rate into three stages: warmup, stable, and decay. It allows us to continue training on any new dataset from any checkpoint in the stable stage without spikes of the training.
Here are the initial learning rates required to continue training at each checkpoint:
| Model | Learning Rate | Schedule | Warmup Steps | Stable Steps |
|---|---|---|---|---|
| Doge-20M | 8e-3 | wsd_scheduler | 800 | 6400 |
| Doge-60M | 6e-3 | wsd_scheduler | 1600 | 12800 |
| Doge-160M | 4e-3 | wsd_scheduler | 2400 | 19200 |
| Doge-320M | 2e-3 | wsd_scheduler | 3200 | 25600 |
Pre-Training:
| Model | Training Data | Steps | Content Length | Tokens | LR | Batch Size | Precision |
|---|---|---|---|---|---|---|---|
| Doge-20M | HuggingFaceTB/smollm-corpus | 8k | 2048 | 4B | 8e-3 | 0.5M | bfloat16 |
| Doge-60M | HuggingFaceTB/smollm-corpus | 16k | 2048 | 16B | 6e-3 | 1M | bfloat16 |
Evaluation:
| Model | MMLU | TriviaQA | ARC-E | ARC-C | PIQA | HellaSwag | OBQA | Winogrande | tokens / s on CPU |
|---|---|---|---|---|---|---|---|---|---|
| Doge-20M | 25.43 | 0.03 | 36.83 | 22.78 | 58.38 | 27.25 | 25.60 | 50.20 | 142 |
| Doge-60M | 26.41 | 0.18 | 50.46 | 25.34 | 61.43 | 31.45 | 28.00 | 50.75 | 62 |
All evaluations are done using five-shot settings, without additional training on the benchmarks.
SFT:
| Model | Training Data | Epochs | Content Length | LR | Batch Size | Precision |
|---|---|---|---|---|---|---|
| Doge-20M-Instruct-SFT | HuggingFaceTB/smoltalk | 2 | 2048 | 8e-4 | 0.25M | bfloat16 |
| Doge-60M-Instruct-SFT | HuggingFaceTB/smoltalk | 2 | 2048 | 6e-4 | 0.25M | bfloat16 |
DPO:
| Model | Training Data | Epochs | Content Length | LR | Batch Size | Precision |
|---|---|---|---|---|---|---|
| Doge-20M-Instruct | HuggingFaceH4/ultrafeedback_binarized | 2 | 1024 | 8e-5 | 0.125M | bfloat16 |
| Doge-60M-Instruct | HuggingFaceH4/ultrafeedback_binarized | 2 | 1024 | 6e-5 | 0.125M | bfloat16 |
Environment:
If you use this codebase, or otherwise find our work valuable, please cite our paper:
@misc{shi2024wonderfulmatrices,
title={Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture},
author={Jingze Shi and Bingheng Wu},
year={2024},
eprint={2412.11834},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2412.11834},
}